File-level malware detection using byte streams

As more documents appear on the Internet, it becomes important to detect malware within the documents. Malware of non-executables might be more dangerous because people usually open them without worrying about inherent danger. Recently, deep learning models are used to analyze byte streams of the non-executables for malware detection. Although they have shown successful results, they are commonly designed for stream-level detection, but not for file-level detection. In this paper, we propose a new method that aggregates the stream-level results to get file-level results for malware detection. We demonstrate its effectiveness by experimental results with our annotated dataset, and show that it gives performance gain of 3.37–5.89% of F1 scores.

www.nature.com/scientificreports/ from 0x0000 to 0x0007 in the header block, and it is usually a sequence of [D0, CF, 11, E0, A1, B1, 1A, E1] which is a signature keyword indicating that it is an OLE file. A data block is more than 512 bytes and has properties: stream data, Big Block Allocation Table (BBAT), and Small Block Allocation Table (SBAT). The properties hold information about files or folders in the device. BBAT is a link-type structure that includes stream location information inside OLE, and increases as the OLE file grows. SBAT stores a small area of data when entering a document. Stream data is the most important in an OLE file and takes up most of the data blocks. To extract byte streams from files, we can use python libraries such as olefile, zlib, BytesIO, and struct. The file header with olefile.OleFileIO and openstream functions are firstly extracted, and it is necessary to check file information such as properties, encryption, compression, and script inclusion. Once the list of streams in the OLE file is obtained, every stream is decompressed.
There have been studies of malware detection on MS office files. Yang et al. 4 proposed a method for detecting MS-DOC malware using CNN models. Through static analysis, they found that the malicious MS-DOC files often have irregular file names and sensitive API calls. They showed that the malicious MS-DOC contains an encryption code to evade malware detection and a large number of meaningless characters. They converted MS-DOC files into 1024×1024 gray images and processed them as input image. The results showed that the model had an average accuracy of 94.70%. Mimura 5 analyzed malicious MS office non-executable documents (e.g., .doc, .docx, .xls, .xlsx, .ppt, and .pptx) using language models. Malicious MS office documents were collected from VirusTotal and normal MS office documents were gathered from Stack overflow 6 . As most malicious MS office documents contain malicious VBA macros, this study checked if functions related to encoding, replacing, or splitting were included in VBA macro. Streams containing VBA macros are used to detect malware using language models and classifiers. Aishwarya et al. 7 created MS office documents through Apache Poor Obfuscation Implementation (POI) and put some macro into randomly selected files. Apache POI allows to read or write MS Office file format in Java language, and supports Word, Excel, PowerPoint and the Open Office XML (OOXML) files (e.g., .docx, .xlsx, and .pptx). They analyzed the file structures of the complex file binary format (e.g., .doc, .ppt, and .xls) and OOXML format, and extracted features using oletools that is a Python package for malware analysis on MS Office documents. The feature set consists of more than 20 fields, and some features are based on fields such as Macro, AutoOpen, Suspicious, IOCs, HexStrings, Base64, Legitmate, Richtext, and DDElink. They exploited machine learning models such as random forest, Gradient boost and Ada boost algorithm for malware classification, and the random forest had the best accuracy of 96%. Most previous studies including above works mainly utilized a customized feature set obtainable from target files, but there are emerging recent studies that directly analyzes byte streams within files using deep learning techniques. In the next subsection, such studies are summarized and their common limitation is explained.
Malware detection on byte streams. Malware detection on non-executables (e.g., MS Word file) is basically a classification task on two classes (e.g., normal and malware). There are two categories of approaches for the task: static and dynamic. The dynamic approach is to exploit a separated platform or an isolated virtual environment (e.g., virtual box), and analyzes step-by-step actions of a suspicious program. The existing studies of this approach have a common limitation that they are not reproducible as they usually exploit different emulations. On the other hand, the static approach is to directly analyze the target file without running it. Therefore, considering that we face more data everyday, the static approach will be preferable as it is relatively more efficient.
There have been studies of the static approach that uses handcrafted features and various machine-learning (ML) models such as support vector machines (SVM) 8 , logistic regression (LR), random forest (RF) 9 , and extreme gradient boosting (XGB) 10 . For example, Ranveer and Hiray 11 trained the SVM with frequency-based patterns obtainable from executables, and it gave 95% of true positive rate (TPR). Morales-Molina et al. 12 defined features on portable executable (PE) files and OpCode sequences, and achieved 89 and 96% of F1 scores using the RF model. Ajit Kumar et al. 13 compared various ML models (e.g., decision tree, random forest, k-nearest neighbors, logistic regression, naive bayes, and linear discriminant analysis), and the best accuracy was 89.23%. Although the ML models have shown quite successful results, they have a common limitation that they require a huge effort of domain experts to find meaningful features.
Deep-learning (DL) technique is one of solutions for this limitation because it is able to learn underlying patterns or features automatically from given data. Especially, few studies exploited convolutional neural networks (CNN) for malware detection or classification, where CNN is known to effectively extract arbitrary local patterns. For example, Thosar et al. 14 proposed a hybrid approach of gradient boosting and CNN for malware family classification, and achieved 93.53% of F1 score. Another studies applied the CNN models directly to the byte streams within non-executables. Raff et al. 1 proposed a new shallow CNN architecture that takes as input a byte stream of PE header. Chen et al. 15 formulated that the bytes in files are image pixels, and designed a CNN model for malware detection by capturing local patterns in the images. Jeong et al. 2 used spatial pyramid pooling with average operation; it dramatically reduced the number of trainable parameters compared to other existing models without losing effectiveness (e.g., accuracy). In another paper of theirs 16 , they found that byte streams of different file formats (e.g., hangul word processor (HWP), portable document format (PDF)) allow CNN models to better learn underlying patterns for malware detection. These studies have shown potential of CNN models with byte streams, but they have a common drawback that their models work in stream-level. Figure 1 depicts how stream-level malware detection works. Each file contains arbitrary number of byte streams, and some streams may have malicious actions. Even if a file has only a single malicious stream, the file should be regarded as malware. With the stream-level model M S in Fig. 1, it misses only a single stream (i.e., the last stream) out of 7 streams, but it actually gave wrong prediction for the file 'B' out of two files; this implies that the model is poor www.nature.com/scientificreports/ for file-level prediction although it may seem fairly good for stream-level prediction. As far as we know, our paper is the first study of file-level malware detection for non-executables using byte streams.

Method
Malware detection services are supposed to provide how likely a given file (or a set of files) contains malicious actions. Even if the file has only a single suspicious byte stream, the services must warn users about the danger. This implies that the services work in file-level, whereas previous studies using byte streams focused on developing stream-level prediction models (e.g., MalConv 1 , SPAP 2 ). It is possible, of course, the stream-level models can be used for file-level prediction, as shown in Fig. 1; the model firstly generates per-stream predictions, and the file-level result will be 'malware' only if there is one or more 'malware' predictions in the per-stream results. However, the prediction model is designed to take as input a byte stream and generates a prediction, so it does not see file-level patterns (i.e., relation between byte streams). Such gap between the prediction model and services will probably lead to poor results (e.g., lower accuracy). When we simply try to train the stream-level models to work in file-level, we face a challenging issue that the models must be applicable to varied number of streams. For example, in Fig. 1, the stream-level model M S takes as input four streams when it learns from file ' A' , whereas it takes three streams for file 'B'; but it is impossible to take such varied number of streams as input if we use conventional architectures such as fully-connected layers, convolutional layers, or pooling layers. Table 1 shows the statistics about the varied number of streams.
As described in Fig. 2 (2) implies that the aggregate function f digests V s (i.e., set of arbitrary number of vectors) and gives a vector v f ∈ R n . Finally, the output (i.e., two-dimensional vector or a probability of malware) is obtained from M F that takes the v f as input. Algorithm 1 provides a formal description of the steps.   Fig. 2. For example, if we deal with the file ' A' , then the four streams with different lengths are firstly padded, so they become to have the same length. Here, we generalize the functionality of M S that takes each stream as input and gives m-dimensional vector as output. If M S is assumed to be a multi-layered network, then it will give vectors of different dimensions as we pick different layers; for example, when M S has three layers of 100, 50, and 2 dimensions, then m might be 100, 50, or 2 depending on which layer we determined to generate output of M S . When we obtain four vectors of m dimensions from the four streams of file ' A' , then they are passed into the aggregate function f that generates n-dimensional representation as output. While there are many options (e.g., hash function) as the aggregate function, we chose 'average' function that generates an averaged vector from the varied number of vectors; m = n when we use the average function because the function generates an averaged value for each dimension.
File-level classifier. With the n-dimensional vector as a feature vector generated by the aggregate function, we train a file-level classifier M F . The classifier might be any conventional models such as linear classification models, neural networks, or support vector machines. In this paper, we used a k-dimensional fully-connected layer followed by a two-dimensional output layer. The difference between M S and M F is that the file-level classifier M F is trained with a n-dimensional feature vector of every file, whereas the stream-level classifier M S is trained with a padded byte stream.
Ethical standards. This study does not involve data of human participants or animals. We have no potential conflicts of interest.

Experiments
Data and settings. We collected malware and normal files of Microsoft (MS) office. Malicious files were obtained from a Korean anti-virus company, and normal files were collected from public portal site operated by the Korean Ministry of the Interior and Safety. We extracted all streams from malicious and normal files in bytes, and if the stream is compressed, it was extracted after decompressing it. The stream data is available online. The data is split into three subsets: training, validation, and test set, as summarized in Table 2. The imbalance of streams is caused by the extremely high variance of the number of malware streams, as shown in Table 1. Streams belonging to the same file are placed in the same subset. We used a machine of Intel(R) Xeon(R) Silver 4214 CPU @ 2.20 GHz 48 cores and four graphics processing unit (GPU) of NVIDIA Quadro RTX 5000. All models are implemented using Python3 language with Tensorflow packages.
Results. We employed two recent CNN models, MalConv 1 and SPAP 2 , as M S , and compared them by performance (e.g., accuracy). The convolutional layers of MalConv have 128 dimensions with kernel size of 500, while the consecutive convolutional layers of SPAP have 64 and 264 dimensions. We made them to have the same dimension of 128 at the layer just before their output layers, and took 128-dimensional real-number vectors from the layer as the output of M S ; in other words, m = 128 as we picked the layer just before their output layer. When we train the SPAP, batch normalization 17 and drop-out techniques 18 are used for convolutional layers and fully-connected layers, respectively. For the MalConv, the drop-out is applied to its fully-connected layers. The www.nature.com/scientificreports/ parameters of MalConv and SPAP models were updated for 20 and 15 epochs, respectively, using cross-entropy loss function and Adam optimizer 19 with initial learning rate of 0.001, where the hyper-parameter setting is determined by a grid searching. We also employed a sample weight technique to mitigate the imbalance of the dataset; that is, the data instances are weighted using a ratio between two classes (i.e., malware and normal). Note that it is not our purpose to improve performance of these two stream-level models, but to prove our proposed method works better than just using the stream-level models for file-level malware detection. We used 'average' function as the aggregate function f that computes averaged value for each dimension of varied number of 128-dimensional vectors that came from the M S . We set k = 32 , so the file-level classifier M F has a 32-dimensional fully-connected layer with a drop-out followed by a 2-dimensional output layer with a softmax function. The parameters of M F were updated for 10 epochs using cross-entropy loss function and Adam optimizer with initial learning rate of 0.001. Figure 3 is a plot of training and validation loss of the file-level classifier. During the training phase of M S , the parameters of M F were not updated. Table 3 summarizes the results of file-level malware detection, where the precision, recall, and F1 scores are averages of three independent experiments. Our method exhibited a significant improvements compared to the stream-level models; for example, our method with SPAP achieved 91.12% of F1 score for malware class whereas using SPAP only gave 85.23% of F1 score. The results imply that our method generally gives 3-6% improvements   www.nature.com/scientificreports/ of F1 scores. Such a big performance gap is due to the inconsistency between the task and the models. That is, the stream-level models are designed and trained for the task of 'stream-level' prediction, so they must be poor on the 'file-level' prediction. On the other hand, our method is designed for the file-level prediction by considering relational information between streams within a given file.
When we look at precision, the models commonly suffer from low performance in normal class, probably due to the class imbalance as shown in Table 2. our method gives significant improvements in the normal class, without losing much precision of malware class; this in turn allows our method to have greater F1 scores. A similar phenomenon appears in recall, and our method again gives improvements in malware class. We also observed that our method gives better performance when it works with SPAP. This implies that using better stream-level model as M S probably contributes to better file-level performance.

Discussion
The two stream-level models, MalConv and SPAP, exhibited comparable results as shown in Table 3, while the SPAP is slightly better in terms of the F1 scores. Both models have shown better results when they work with our method. The reason of such performance gap is that our method eliminates the inconsistency between the stream-level models and the file-level task. Our aggregate function plays a crucial role in here, as it makes a fixed dimensional vector out of varied number of byte streams in each file.
Our method might be seen that we just use the stream-level models M S as feature generators and train a shallow neural network as a file-level classifier. It is worth noting that it is not trivial to use M S as feature generator because there are varied number of streams in a file; that is, the dimension of feature vector varies if we simply concatenate the output vectors of M S , so it is not possible to use it as a feature vector for the file-level classifier. Therefore, the aggregate function f plays a crucial role in our method because it allows the feature vector to have a fixed dimension.
In this paper, we basically chose the 'average' function as the aggregate function f. We also investigated some other functions to check how important the aggregate function is. Table 4 summarizes the results with two different aggregate functions: 'max' and 'min' . Interestingly, these two functions did not give much performance improvements compared to the results of stream-level models. This might be explained that the generated vectors by the stream-level models carry meaningful information of the varied number of streams, and 'min' or 'max' operation may lose such meaningful information as they keep only the smallest or greatest values.

Conclusions
In this paper, we proposed a new method of file-level malware detection. The proposed method consists of a stream-level model M S , an aggregate function f, and a file-level classifier M F . By experimental results, we demonstrated that our method works better than previous stream-level models. We trained M S firstly, and used their output vectors to generate a fixed length of feature vector using f. As a future work, we plan to design a new file-level model that does not require to train such stream-level models. We are also looking for better aggregate functions instead of the 'average' function, and perform experiments with other stream-level models.

Data availability
The stream data used in this study is available for non-commercial use: https:// sites. google. com/ arkang. net/ malwa rebyt estre ams.